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Abstract 

We derive oracle inequalities for the problems of isotonic and convex regression using the 
combination of Q-aggregation procedure and sparsity pattern aggregation. This improves 
upon the previous results including the oracle inequalities for the constrained least squares 
estimator. One of the improvements is that our oracle inequalities are sharp, i.e., with 
leading constant 1. It allows us to obtain bounds for the minimax regret thus accounting 
for model misspecification, which was not possible based on the previous results. Another 
improvement is that we obtain oracle inequalities both with high probability and in expec¬ 
tation. 


1. Introduction 

Assume that we have the observations 


hi — pi 


i = 1, ...,n, 


( 1 . 1 ) 


where p = {pi,pn)"’" G is unknown, ^ = {^i, is a noise vector with n- 

dimensional Gaussian distribution Af{0, cr'^Inxn) where u > 0. We observe y = (Yi, ..., Yn)'^ 
and we want to estimate p. We can interpret pi as the values /(Aj) of an unknown regres¬ 
sion function / : A ^ M at given non-random points Xi E A, i = l,...,n, where A is an 
abstract set. Then, the equivalent setting is that we observe y along with (Xi, ..., X„) but 
the values of Xi are of no interest and can be replaced by their indices if we measure the 
loss in a discrete norm. Namely, for any u E we consider the scaled (or the empirical) 
norm || • || defined by 


fu = 


n 


2=1 


( 1 . 2 ) 


We will measure the error of an estimator jl oi phy the distance ||/i — p\\. Let 5^ be the 
set of all non-decreasing sequences: 


:= {u = {ui, ...,Un) eMP : Ui < Ui+i, z = 1, ..., re - 1}. (1.3) 


For a subset S of <S^, and any p E M” the quantity min^g^ ||re — /r|| is the smallest approx¬ 
imation error achievable by a sequence in the set S. This quantity defines a benchmark or 
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oracle performance on S. The accuracy of an estimator p, with respect to the oracle for any 
/^, not necessarily /j £ 5, can be characterized by the excess loss ||/i —/i|| — min^g^ ||w — /^||. 
This is a measure of performance of /x under model misspecification. One can also con¬ 
sider the expected quantities iii(/x,/x) = IE^||/x — /x|| — min„g 5 ||xx — /x|| or i?2(A)/^) = 
E^IIA — /^IP — min^g 5 ||ix — /x|p known under the name of regret measures. Here, de¬ 
notes the expectation with respect to the distribution of y satisfying (1.1). The minimax 
regret is dehned as min^max^gKn Ri{p,fj.) for i = 1,2, where miny;^ denotes the minimum 
over all estimators. We can characterize the performance of an estimator /x by the closeness 
of its maximal regret max^gign to the minimax regret. This approach to measure 

the performance of estimators under model misspecihcation was pioneered by Vapnik and 
Chervonenkis who called it the criterion of minimax of the loss (Vapnik and Chervonenkis, 
1974, Chapter 6). In this paper, we follow this approach and establish non-asymptotic 
bounds for the maximal regret for some classes S of monotone and convex functions. 

When the model is well-specified, i.e., the true function /x belongs to the class <S, the 
approximation error vanishes and instead of the minimax regret it is natural to consider 
the minimax risk defined either as min^ max^g^ E^|| A — /^|| or as min^ max^g^ E^|| A — /.tp 
(the minimax squared risk). It is easy to see that the minimax risk is not greater than the 
minimax regret. A classical problem in nonparametric statistics is to study the behavior 
of minimax risks for different classes S. In particular, there exist results concerning the 
minimax risks for classes of monotone and convex functions in our setting. We review some 
of them below. The behavior of the minimax regret is much less studied. For a recent 
overview and some general results we refer to Rakhlin et al. (2013) where it is shown that 
the rate of minimax regret can be different from that of the minimax risk. Note that 
Rakhlin et al. (2013) studies the prediction problem with i.i.d. observations, which is a 
setting different from ours. 

A well-studied estimator under the monotonicity and convexity assumptions is the least 
squares estimator 

A^‘^(5) £ argmin ||y — xx||^ . (1-4) 

uGS 

In Nemirovski et al. (1985) it was shown that attains, up to logarithmic factors, 

the rates and of the mean squared risk for classes S of monotone and convex 

functions respectively and that these rates are optimal up to logarithmic factors when the 
minimax squared risk is used as a criterion. Under monotonicity constraints, the rate 
was later observed in different settings, see for instance Banerjee and Wellner (2001); 
Balabdaoui and Wellner (2007). 

One class of monotone functions we will be interested in here is dehned as 

>S^(U) = {/X £ 5^ U(/x) < U} 

where V (/x) = for any /x = (^i,..., /x„) £ S^, and V > 0 is a given constant. In 

Meyer and Woodroofe (2000); Zhang (2002) it was shown that for any /x £ 5^ we have 


E„ IIA — A^ll < cmax 


n2U(/x) 


2/3 


log n 


(1.5) 
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for (1 = and some absolute constant c > 0. This immediately implies an upper 

bound on the minimax risk on A recent paper Chatterjee et al. (2015) establishes 

the oracle inequality 


E,, 




< C* min 

itG5t 


I ,,2 c^a'^kiu) , en 

\fj.-u\\ H-log ITT 

n k[u) 


( 1 . 6 ) 


valid for all /x € 5^ where either C* = 6,c* = 1 (Chatterjee et ah, 2015, inequality (18)) 
or C* = 4, c* = 4 (Chatterjee et ah, 2015, inequality (30)). Here, k{u) > 1 for w = 
(tti, ... ,Un) G 5^ is the integer such that k{u) — 1 is the number of inequalities Ui < Uj+i 
that are strict for i = 1,..., n — 1 (number of jumps of u). Inequality (1.6) implies (up to a 
logarithmic factor) a bound as in (1.5) and also gives some more insight into the problem. 
For example, (1.6) shows that the fast rate ig achieved if fi has only one jump or a 
fixed, independent of n, number of jumps. This is not granted by (1.5). 

Along with the least squares estimator, one may consider estimation of monotone func¬ 
tions via penalized least squares with total variation penalty. The corresponding estimator 
is dehned as 


'-TV ^ 

(jL € argmm 
uGM" 



- y|l 


n—1 

+ A ^ 

i=l 



(1.7) 


where A > 0 is a tuning parameter. Statistical properties of this estimator were hrst studied 
in Mammen and van de Geer (1997) where it was shown that —/x|| attains the optimal 
rate in probability on the class of functions of bounded variation (and thus on 5^(H)). 

Recently, the performance of piF^ was analyzed in Dalalyan et al. (2014) by considering 
pF^ as a special instance of the Lasso estimator. If F is the projection of /x onto <5^^, 
5 G (0,1) is a constant, and the tuning parameter A is given by 


\ = a\ 


/ log(n/(5) 


k*n 


where k* = 


H(/x’i)^nlog(n/(5) 


1/3 




( 1 . 8 ) 


the estimator pF^ satisfies with probability greater than 1 — 25 the following oracle inequal¬ 
ity (Dalalyan et ah, 2014, Proposition 6): 




TV 




< 




+ 6 


' FV{F)FFg{n/I) 


2/3 


n 


^ 2o-^(l -h 21og(l/h)) 

n 


(1.9) 


for all /X G M”. It follows from (1.9) that if the tuning parameter is chosen correctly, the 
estimator pF^ achieves, up to a logarithmic factor, the minimax rate in probabil¬ 
ity on the class 5^(P). Also, (1.9) implies a bound for the excess losses “ rII* ~ 

niin^g 5 t(u) — r||\ f = 1;2, corresponding to the class 5^(P). However, (1.9) does not 
allow us to evaluate the expected regrets i?i(/x ,/x) since /x depends on 5. It is also 


3 
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shown in (Dalalyan et al., 2014, Proposition 4) that if A = 20 \J (2/n) log(n/(5), the estimator 
satisfies 




TV 




< min 


2 , ^(j‘^k{u)\og{n/6) ^ ^ 

\u-fx\\ H- rn[u) 

n 


( 1 . 10 ) 


with probability greater than 1 — 25, where k{u) — 1 for u G M” is the number of jumps of 
u, i.e., the cardinality of the set {i G {1, ...,n — 1} : Ui ^ Ui+i}, rn(u) = 3 + 256(log(n) + 
(n/A(u))) and A(u) is the minimum distance between two jumps in the sequence u: 


A{u) = min {d > 1 : 3k e {1,n} with Uk+i 7 ^ Uk and Uk+d+i / ttfc+d} • 


The expressions on the right hand sides of (1.6) and (1.10) are small if the unknown 
sequence /x is well approximated by a piecewise constant sequence with not too many pieces. 
In this regard, the two bounds have some similarity to sparsity oracle inequalities in high¬ 
dimensional linear regression, cf. Rigollet and Tsybakov (2011, 2012); Tsybakov (2014). 
This similarity can be easily explained as follows. Write (1.1) in the equivalent form 


y = X/3*+t 


with the matrix X = j=i,...,n where Xij = 1 if j < i and Xij = 0 otherwise, and 

13* = {PI, ..., P*) where Pf = fii and P* = fii — for z = 2, ..., re. With this notation, 
A:(/x) G {|/3*|o, l+|,9*|o}, where |^*|o denotes the number of non-zero components of The 
value k{fx) is small when P* is sparse. Thus, the problem of estimation of piecewise constant 
sequence /x with small number of pieces can be considered as the problem of prediction in 
sparse linear regression with a specific design matrix X. Similarly, we may write u = X,9, 
for P with components Pi = ui and Pi = Ui — Ui-i for z = 2, ..., re. These remarks suggest 
that we can apply the theory of sparsity oracle inequalities, in particular, sparsity pattern 
aggregation (cf. Rigollet and Tsybakov (2011, 2012); Tsybakov (2014)) in the context of 
monotone estimation described above. Similar observation is valid for estimation under 
convexity constraints (see Section 3 below). In the present paper, we develop this argument 
using as a building block the Q-aggregation procedures Rigollet (2012); Dai et al. (2012, 
2014); Bellec (2014). In particular, we construct an estimator /x such that 


IIA - 


/xll < min 
«e5t 


, ,,2 co^k(u) , ere 

\fi-u\\^ + -^log- 


re 


k{u) 


V /X G 


( 1 . 11 ) 


for some absolute constant c > 0. Note that (1.11) is a sharp oracle inequality (i.e., an 
inequality with leading constant 1). It improves upon the oracle inequality (1.6) for the 
least squares estimator where the leading constant C* is noticeably greater than 1 and the 
bound is valid only for /x G 5^. The advantage of having leading constant 1 and arbitrary 
/X in (1.11) is that it allows us to derive bounds on the excess risk and on the minimax 
regret, which was not possible based on the previous results. We also obtain sharp oracle 
inequalities with high probability for the same estimator. In addition, we show that it 
satisfies stronger sharp inequalities with the minimum min^g^t on the right hand side of 
(1.11) replaced by min^giRn. This implies that our results are invariant to the direction of 
monotonicity; they remain valid if we replace everywhere monotone increasing by monotone 
decreasing functions. Finally, we derive similar results for the problem of estimation under 
the convexity constraints improving an oracle inequality obtained in Guntuboyina and Sen 
(2013). 
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2. Sparsity pattern aggregation for piecewise constant seqnences 

For any non-empty set J C {1, ...,n — 1}, let |J| denote the cardinality of J and define 


exp(-|J|) 

^ ( | 7 | ) 


n—1 


H := ^exp(-i). 
i=0 


( 2 . 1 ) 


Let Pj € be the projector on the linear subspace Vj of M"" defined by 

Vj := € M” : Vi G {1, ...,n - 1} \ J, Uj+i = ttij. (2.2) 

In words, Vj is the space of all piecewise constant sequences that have jumps only at points 
in J. Given a vector y of observations and 9 = {9 j) jc{i,...,n-i} where each 9j € M, let 


Pe= OjPjy- 


(2.3) 


Finally, let 

= Me 

where 6 is the solution of the optimization problem 


(2.4) 


II Il2 n (1 II „ ||2 4:6(7^ 1 

mm ||M,-y|| + ^ - Pyy|| +—log 


JC{l,...,n—1} 


n 


TTJ 


where 


A = ■^ 0 : > 0 for all J C {1,..., n — 1}, and ^ 9j = l^. 

1 } 


This optimization problem is a convex quadratic program with a simplex constraint. It 
performs aggregation of the linear estimators (Tj/y) jc{i,...,n-i} using the Q-aggregation 
procedure Dai et al. (2012, 2014); Bellec (2014) with the prior weights (2.1). As the size of 
this quadratic program is of order 2”, it is a computationally hard problem. The estimator 
pP satisfies the following sharp oracle inequalities. 


Theorem 1 Let G M”, n > 2, and assume that the noise vector ^ has distribution 
M{0,a‘^lnxn)- There exist absolute constants c,c' > 0 such that for all 5 G (0,1/3), the 
estimator satisfies with probability at least 1 — 3(5, 


and 


p^ - 


E,. 


< min IX — u\\ -|- 
uSM" \ 


2 ca‘^k{u)) en \ log(l/(5) 


p^ - fx 


2 c'a‘^k{u) en 


n 


< min Wfx — u II H-log 


itGlR’ 


n 


k{u) 


(2.5) 


( 2 . 6 ) 
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Proof Let J C ,n — 1}. Denote by d = |J| + 1 the dimension of the subspace Vj. 
Then, the projection estimator Pjy satisfies with probability at least 1 —d (see, for example, 
Hsu et al. (2012)): 


\\Pjy - Mil" < IIAm - Mil" + '^ + V‘^‘°g(iM) + 2i°g(i/.i) 

n 


^ ■ II l|2 I 

< mm pi — a\\ + 

ueVj " 


2(|J| + l)+31og(l/,5) 
n 


(2.7) 


The sharp oracle inequality from Bellec (2014) yields that with probability at least 1 — 25 
for all J C {1,..., n — 1} we have 




< \\Pjy - /if + CcT^ log — + log(l/(j), 


( 2 . 8 ) 


for some absolute constant C > 0. Combining (2.7) and (2.8) with the union bound and 
the inequality (cf. (Rigollet and Tsybakov, 2012, (5.4))) log(l/7rj) < 2(| J| +1) log(en/(| J| + 
1 )) + 1/2, we find that with probability at least 1 — 35, 

^ fII ii 2 J| + 1) , / en \\ 

p,^ — ii < mm mm tt — u\\ H- - log ■;—;- 

“ JC{l,...,n-l}wGy7 fn \\J\ + 1J) 

+ CCT^ log(l/5) 


where c > 0 is an absolute constant. Since | J|+l = k{u) for all u ^Vj and minjc{i,...,n-i} ™inuGV> 
minijeRM, the bound (2.5) follows. Finally, (2.6) is obtained from (2.5) by integration. ■ 


We now discuss some corollaries of Theorem 1. First, it follows that (1.11) is satished for 
p = p^, so the remarks after (1.11) apply. Next, in view of (2.6), for the class of monotone 
sequences with at most k jumps = {u £ : k{u) < A:} we have the following bounds 

for the maximal expected regrets 


max I E^ll/i® — /x|| — min \\u — /i| 




Mg<Sl 


max ( E,, ll/l® — ttlP — min \\u — rtlP 

uesl , 



(2.9) 

( 2 . 10 ) 


where c > 0 is an absolute constant. The same bounds hold for the minimax risks over Sj, 
since the minimax risk is smaller than the minimax regret. Theorem 4 below shows that 
the bounds (2.9) and (2.10) are optimal up to logarithmic factors. 

Finally, consider the consequences of Theorem 1 for the class iS^(F). To this end, define 
the integer k* such that 


k* = min 


m G N : m > 


I log(ere) 


1/3 


if the set G N : m > (^ non-empty, and A:* = 1 otherwise. We will need 
the following lemma. 
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Lemma 2 Let jj, G and let 1 < k < n be an integer. 
— "t 

u ^ such that 


\u — /x|| < 


2 k 


Then there exists a sequence 


( 2 . 11 ) 


Next, there exists a sequence u G 5^* 


such that 


|ti — /x| 


1 

< — max 
4 


log(en) 


2/3 


cr^ log(ere) 


n 


In addition, 


a'^k* 

n 


CTt 

log — <2 max 
k* 


log(en) 


2/3 


cj^ log(en) 


n 


( 2 . 12 ) 


(2.13) 


Proof To construct the sequence u, consider the k intervals 

hi + + r^(/^))> i = - 1, 




and Ik = [/ii + ^V{n),iJ,n]. For all j = 1, ...,k, let 


Jj — — l,...,n . fii G Ij'\‘ 


(2.14) 


(2.15) 


For any i G re} there exists a unique j G {1,...,A:} such that i G Ij. Let Ui = 

/ii + Li^l/(^) for all i G Ij. Then the sequence u = {ui,... ,Un) is non-decreasing, it 
has at most k pieces, i.e., k{u) < k, and \ui — Hi\ < for i = l,...,re. Thus (2.11) 
follows. Next, note that if k* = 1, then F(/r)^ < cr^ log(ere)/re. If k* > 1, then by definition 
of k*, V{n)‘^/{k*)‘^ < (cj^F(/i) log(ere)/re)^/^. Thus, (2.12) follows. The bound (2.13) is 
straightforward by studying the cases k* = 1 and k* > 1 separately. ■ 


We can now derive the following corollary of Theorem 1. 


Corollary 3 Under the assumptions of Theorem 1, there exists an absolute constant c > 0 
such that, for any /.i G 


— /x|p < c max 


( / a^V{pL) log re 

V ^ 


2/3 


(7^ log re 
re 


(2.16) 


In addition, for any V > 0 and any /x G M"' the expected regret of satisfies 

„ „ ^ I / cr^P log re /l^ 

\\u — u\\ < c max - ,a\ - 

wG5t(U) " \ \ ^ / V ^ 


E^ll/x^ —/x|| — min ||re —/x|| < c max 


(2.17) 


where c> D is an absolute constant. 
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Proof Inequality (2.16) is straightforward in view of (2.6), (2.12), and (2.13). To prove 
(2.17), fix any /r € M"' and consider 


II € argmin ||/.i ~m||- 

From (2.6) and the fact that the function x !->■ xlog (^) is increasing for 1 < x < n we get 


— fj,\\ < min I ||m —/r|| + Wc'-log 




u£S! 


n 


k* J 


^ . II *11 II * II / .cr'^k* f en\ 

< mm ||w — /2 II + ll/r — /r|| + Wc'-log — 


u&S} 


^11* III If 

< WfJ, — fj,\\ + c max 


cj^Plogre 


n 

1/3 


k* J 


, cr\ 


n 


/log n 


n 


for an absolute constant c” > 0 where the last inequality follows from (2.12) and (2.13). 


The estimator shown in Theorem 1 satishes the sharp oracle inequalities both in 
expectation and with high probability. Previous results for the least squares estimator 
Chatterjee et al. (2015) were only obtained in expectation and the results on the £i-penalized 
estimator (1.7) are only obtained with high probability. 

Finally, the following result shows that the upper bounds (2.9) and (2.10) are optimal 
up to logarithmic factors. 

Proposition 4 Let n > 2,17 >0 and a > 0. There exist absolute constants c,c' >0 such 
that for any positive integer k < n satisfying k^ < IGnV'^fa'^ we have 

inf sup [ ll/it — pIP > > c' , (2-18) 

^ tieslns-fiv) \ J 

where P^ denotes the distribution of y satisfying (1.1) and inf^ is the infimum over all 
estimators. 


For k = 1, ...,n, take any P > 0 large enough to satisfy k^ < 16nV^/a‘^. Then, Theorem 4 
and Markov’s inequality yield the following lower bounds on the minimax risks over the 
class Sl: 


inf sup P/ill/i 



. 2 

inf sup Wp, — pII > 



(2.19) 


As the minimax risk is smaller than the minimax regret, (2.19) also provides lower bounds 
for the corresponding minimax regrets over iS^. Combining this with (2.9) and (2.10) we 
find that the estimator achieves up to logarithmic factors the optimal rate with respect 
to the minimax regret. 

Next, Proposition 4 implies the following lower bound on the minimax deviation risk on 

5t(P). 
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Corollary 5 Let n > 2,V >0 and a > 0. There exist absolute constants c,c'>0 such that 


inf sup I ll/i — > cmax 



( 2 . 20 ) 


To prove this corollary it is enough to note that if 16nV'^/a'^ > 1, by choosing k in 
Proposition 4 as the integer part of (16nl4^/cj^)^/^, we obtain the lower bound corresponding 

to under the maximum in (2.20). On the other hand, if lOnP^/cr^ < 1 the term 

2 2 
— is dominant, so that we need to have the lower bound of the order —, which is trivial 

(it follows from a reduction to the bound for the class composed of two constant functions). 

It follows from (2.20) and (2.16) that the estimator achieves, up to logarithmic 

factors, the optimal rate with respect to the minimax risk on the class 5^(P). Using (2.17) 

and the fact that the minimax risk is smaller than the minimax regret, we conclude that it 

is also the optimal rate up to logarithmic factors for the minimax regret. 

Proof [Proof of Theorem 4] We assume for simplicity that n is a multiple of k. The general 

case is treated analogously. For any u, cj' € {0,1}^, let uj') = |{f = 1,..., k : Ui a;'}| 

be the Hamming distance between u and uj'. By the Varshamov-Gilbert bound (Tsybakov, 

2009, Lemma 2.9), there exists a set H C {0,1}^ such that 

0 = (0,..., 0) G H, log(|H| — 1) > A:/ 8 , and dH{oJ,u') > k/8 (2-21) 


for any two distinct uj,u:' G H. For each a; G H, dehne a vector G M"" with components 

[(f - l)A:/nJ V 


CJ 

Ui = 


2 k 




where 7 = (l/ 8 )\/cr^/c/n, and [xj denotes the maximal integer smaller than x. For any 
u € Ll, is a piecewise constant sequence with k{u‘^) < k, is a non-decreasing 
sequence because 7 < V/{2k), and by construction V{u‘^ ) < V. Thus, G Sj, n5"'^(U) for 
all a; G H. Moreover, for any u:,u' G H, 

2 2 2 L 

,1 ^ _ c ^'||2 ^ > ^ = ——. ( 2 . 22 ) 

Set for brevity The Kullback-Leibler divergence between and 

P^i is equal to for all G Ll. Thus, 


K{Pu:,Po) 


7 ^n(iH( 0 ,aj) ^ ^ log(|H| - 1 ) 

2 /ccj 2 - 128 “ 16 


Applying (Tsybakov, 2009, Theorem 2.7) with a = 1/16 completes the proof. 


(2.23) 


3. Estimation of convex sequences by aggregation 

Assume that n > 3 and define the set of convex sequences as follows: 

5*^ = {xi = [ui,... ,Un) G M" : 2ui < Ui+i + Ui-i, i = 2,... ,n - 1}. (3.1) 
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For any u € M”, we introduce the integer q{u) > 1 such that q{u) — 1 is the cardinality 
of the set {i = 1, ...,n — 1 : 2ui ^ Ui+i + Ui-i\. If u G 5^^, q{u) — 1 is the number of 
inequalities 2ui < Uj+i + ui-i that are strict for i = 2, ...,n — 1. The value q{u) is small if 
is a piecewise linear sequence with a small number of pieces. 

The performance of the least squares estimator over convex sequences has been 

recently studied in Guntuboyina and Sen (2013). If the unknown vector belongs to the 
set S^, Guntuboyina and Sen (2013) shows that the estimator satisfies the risk 

bound 



where i?(/x) = max(l, min{||T — /x|p ,t is affine}) and c > 0 is an absolute constant. It is 
proved in (Ghatterjee et ah, 2015, Example 2.3) that the least squares estimator 
satisfies the oracle inequality 



(3.2) 


where c > 0 is an absolute constant. The right hand side of (3.2) is small if the unknown 
vector /2 can be well approximated by a piecewise linear sequence in with not too many 
pieces. 

The leading constant in (3.2) is 6. We will show that sparsity pattern aggregation 
achieves a substantially better performance. We obtain the sharp oracle inequality (3.5) 
below, improving upon (3.2) not only in the fact that the leading constant is 1 but also in 
the rate of the remainder term; we will see that the exponent 5/4 of the logarithmic factor 
is reduced to 1. 

For any set J C {2, ...,n — 1}, define 


vj ■= 


exp(-|J|) 
He (17|) ’ 


n—2 

He ■= exp(-i). 
i=0 


(3.3) 


Let Qj £ be the projector on the linear subspace Wj of K"" given by 

Wj := G M"" : Vi G {2,..., n — 1} \ J, 2ui = Uj+i + 


Given a vector y of observations and 6 = (^j) where each 9j belongs to M, let 


Finally, let 


Me = Y ^jQjy- 


P^Q-conv ^ 


where 6 is the solution of the optimization problem 

'2(t2|J| 1 


min ||M»-yf+ E Oj „ 

,/C{2,...,n-l} \ 


- II ^ ||2 46cj 1 

+ o Me - Qjy\\ H-log — 

2 n uj 
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where 

= < 0 : > 0 for all J C {2,n — 1}, and ^ 9j = l 

The structure of this minimization problem is the same as of its analog introduced in Section 
2. This is a quadratic program that aggregates the linear estimators (Qjy) using 

the Q-aggregation procedure Dai et al. (2012, 2014); Bellec (2014) with the prior weights 
(3.3). 



Theorem 6 Let /x G M”, n > 2>, and assume that the noise vector ^ has distribution 
M{0,a‘^lnxn)- There exist absolute constants c,c' > 0 such that for all 6 G (0,1/3), the 
estimator satisfies with probability at least 1 — 3(5, 


f-Q-conv _ 

fj, fj. 


< min IX — u\\ + 
uSM" \ 


2 ca‘^q{u) en \ ca‘^\og{l/5) 


n 


log 


+ 


and we have 


E, 


P^Q-conv _ ^ 


< min ix — ti + 
uGM" \ 


q{u 

2 c'a'^q{u) en 


n 


n 


log 


q{u) 


(3.4) 


(3.5) 


The proof of this theorem is the same as that of Theorem 1 with the only difference that 
J is now a subset of {2,...,n — 1} rather than that of {l,...,n — 1}, and we replace the 
notation Pj and Vj by Qj and Wj respectively. 

The leading constant of the oracle inequality (3.5) is 1, and the remainder term is 
proportional to q{u)\og{en/q{u)). These are two improvements upon (3.2), where the 
leading constant is 6 and the remainder term is proportional to q{u)\og{en/q{u)fi/^. 

In view of (3.5), for the class of piecewise linear convex sequences with at most q linear 
pieces, = {tx G : q{u) < g} we have the following bounds for the maximal expected 
regrets 

max (e„||/x‘^ — /xll — min llxx — /xll ] < cD ^ log V (3.6) 

^6®" \ J \l n \ q J 

max fE^II/x*^ — /xlP — min l|xx — /x|p^ < log (3.7) 

where c > 0 is an absolute constant. The same bounds hold for the minimax risks over 
since the minimax risk is smaller than the minimax regret. 

The following proposition shows that the rates of convergence in (3.6) and (3.7) are 
optimal up to logarithmic factors. We omit the discussion since it is similar to that after 
Theorem 4. 


Proposition 7 Let n > 3. There exist absolute constants c,c' > 0 such that, for any 
positive integer q < n, 

inf sup [ ll/x — /x|p > ^ I > c', (3.8) 

A V ^ / 

where the infimum is taken over all estimators. 
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Proof Assume that q >2 since for q = 1 the result is trivial. We also assume for simplicity 
that n is a multiple of q. Let m = n/q and 7 = ( 1 / 8 )^/cr^g/n. Set /3o = 0 ,oq = 0 and 
define, for all integers j > 1, 

/3j = I3j-i + 7 + moj-i, Uj = 27 + aj-i. (3.9) 

By the Varshamov-Gilbert bound (Tsybakov, 2009, Lemma 2.9) there exists C {0, 
such that (2.21) is satisfied, with k replaced by q. For each cj £ fl, define a vector £ M"" 
with components 

= a^i+i7+ - 1)+/3j, j = 0, ...,g-l, i = l,...,m. 

The sequence is piecewise linear. It is linear with slope aj on the set {jm+1,..., {j+l)m} 
for any j = 0,..., q — 1- Thus, q{u^) = q. Next, we prove that for all lj £ Q. It is 

enough to check the convexity condition at the endpoints of the linear pieces: 


2 u“^ < 


— '^jm “jm+2! 


< 


(3.10) 


for all j = 1,..., q — 1. Using (3.9) we get that, for all j = 1,..., q — 1, 


= ^i+il + I3j - (Wj7 + - 1) + ^j-i), 

= ~ + 1)7 + “i-i) 

= {tJj + l - UJj - 1)7 + Uj. 


Hence, aj-i < < aj. Since also Uj-i = and Uj = 

it follows that the two inequalities (3.10) hold, for all j = 1,...,^ — 1. Thus, £ 5^. In 

summary, we have proved that £ Sq for all cj £ H. 

Now, from the Varshamov-Gilbert bound, cf. (2.21), for £Vl we have 


”7 

—dHiuJ,uj ) 

q 



a'^q 

512n’ 


(3.11) 


where dni-,-) is the Hamming distance. Finally, similarly to (2.23), the Kullback-Leibler 
divergence between and Pq satisfies K{Pi^,Po) < IhiIMiT). Applying (Tsybakov, 2009, 
Theorem 2.7) with ol = 1/16 completes the proof. ■ 


4. Concluding remarks and discussion 

In this short note, we have shown that the estimators and based on sparsity 

pattern aggregation (in its Q-aggregation version) achieve oracle inequalities that improve 
on some previous results for isotonic and convex regression. 

One of the improvements is that oracle inequalities (2.6) and (3.5) are sharp, i.e., with 
leading constant 1 and they are valid for all jj, £ M"". It allows us to obtain bounds for the 
minimax regret under arbitrary model misspecification, which was not possible based on 
the previous results. We show that these bounds are rate optimal up to logarithmic factors. 
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The question on whether the least squares estimators under monotonicity and convexity 
constraints can achieve sharp oracle inequalities with correct rates remains open. 

Another improvement is that we obtain oracle inequalities both with high probability 
and in expectation, which was not the case in the previous work. 

An advantage of the least squares estimator is that it requires no tuning parameters. 
In particular, the knowledge of is not needed to construct the estimators and 

jg contrast to the ii penalized estimator (1.7) and the estimators and 

pQ-conv.^ their construction requires the knowledge of For the penalized estimator 
(1.7), the issue may be addressed by using a scale-free version of the Lasso Belloni et al. 
(2014); Sun and Zhang (2012). For the Q-aggregation estimators /I® and ^ can 

treat the issue of unknown a as in Bellec (2014). Namely, it is shown in Bellec (2014) that 
the oracle inequalities for Q-aggregation procedures are essentially preserved after plugging 
in an estimator of cr^ that satisfies — 1| < 1/8 with high probability, which is even 

weaker than consistency. 

Finally, note that instead of Q-aggregation we could have used sparsity pattern ag¬ 
gregation by the Exponential Screening procedure of Rigollet and Tsybakov (2011). This 
would lead to sharp oracle inequalities in expectation of the form (2.6) and (3.5) but not 
to inequalities with high probability such as (2.5) and (3.4). This is the reason why we 
have opted for Q-aggregation rather than for Exponential Screening in this paper. On the 
other hand, Exponential Screening estimators are computationally more attractive than Q- 
aggregation since they can be successfully approximated by MCMC algorithms (see Rigollet 
and Tsybakov (2011, 2012) for details). 
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